Using XGBoost in Mental Disorder Classification

Fall 2025

Author

Aabiya Mansoor, Abigail Penza Jackson, Corina Rich, Madelyn Champion (Advisor: Dr. Cohen)

Published

December 8, 2025

Slides: slides.html ( Go to slides.qmd to edit)

Introduction

eXtreme Gradient Boosting (XGBoost) is a state-of-the-art, supervised machine learning algorithm renowned for its high performance, speed, and scalability (Zhang, Jia, and Shang 2022). Developed as an optimized implementation of the gradient boosting framework, XGBoost builds a powerful predictive model by sequentially creating an ensemble of weak decision trees, where each new tree is trained to correct the residual errors of the previous ones (Zhang, Jia, and Shang 2022). Its importance in the data science community stems from several key architectural advantages, including a sparsity-aware algorithm for handling missing or zero-value data, parallel processing capabilities for faster computation, and built-in L1 and L2 regularization to prevent model overfitting. The algorithm’s versatility and effectiveness have been demonstrated across a wide range of domains, from achieving high accuracy in financial credit scoring (Li et al. 2020) to enhancing learner performance prediction in education (Hakkal and Ait Lahcen 2024).

This research applies the power of XGBoost to address the significant and growing challenge of mental disorder classification. The early and accurate identification of mental health conditions is crucial for effective intervention, yet traditional diagnostic methods can be subjective, resource intensive, and inaccessible to many. Structured clinical and survey data provide an opportunity to apply computational methods to find objective patterns that may aid in this diagnostic process. The objective of this study is to develop and evaluate a robust XGBoost model for the multi-class classification of individuals into one of four categories: Bipolar I Disorder, Bipolar II Disorder, Major Depressive Disorder, and Normal (Chahar, Dubey, and Narang 2024).

The selection of XGBoost for this complex classification problem, containing information of 120 patients associated with 17 symptoms, is particularly strategic. The dataset consists of tabular survey responses, where the relationships between different answers (features) and the final diagnosis are complex and non-linear. XGBoost excels at identifying these intricate patterns in structured data. Its ability to perform feature importance ranking will also help in understanding which survey questions are most predictive of a specific mental health condition. Furthermore, XGBoost’s built-in regularization is crucial for preventing the model from overfitting to the training data, ensuring it can generalize accurately to new, unseen patient cases. These features make XGBoost an exceptionally suitable framework for the nuanced task of classifying mental health disorders based on categorical survey data.

Literature Review

eXtreme Gradient Boosting (XGBoost) has solidified its reputation as a high-performance, scalable, and adaptable machine learning algorithm widely applied across disciplines such as healthcare, education, public health, finance, and engineering. Initially introduced by Chen and Guestrin (Chen and Guestrin 2016), XGBoost builds its predictive strength by constructing an ensemble of weak learners, typically decision trees—where each subsequent model corrects the errors of its predecessor. Its unique architectural advantages, including a sparsity-aware algorithm, weighted quantile sketch, parallelized learning, and built-in L1/L2 regularization, contribute to its speed and accuracy, particularly when handling missing values and high dimensional data. These strengths have made XGBoost especially effective in real world data challenges that involve class imbalance, heterogeneous data types, or non-linear relationships.

In the healthcare domain, XGBoost has been instrumental in diagnosing and predicting diseases with enhanced accuracy and speed. For example, Liew et al. (Liew, Hameed, and Clos 2021) used a hybrid model combining deep learning and XGBoost for breast cancer classification based on histopathological images, demonstrating its ability to detect and differentiate cancer types with high reliability. Similarly, Sharma and Verbeke (Sharma and Verbeke 2020) leveraged XGBoost with biomarker data to improve depression diagnoses in a large Dutch population dataset. Their findings highlighted the importance of resampling techniques to counter data imbalance, as oversampled models achieved precision and recall scores above 0.90. Xu et al. (Xu et al. 2024) applied XGBoost to multi-modal datasets to predict self-harm in young adults, achieving a balanced accuracy of 0.800, and identifying non-linear predictors such as suicidal ideation and NTRK2 gene variants. Chahar et al. (Chahar, Dubey, and Narang 2024) further extended the model’s mental health applications by using a hybrid XGBoost–Hippopotamus Optimization Algorithm (XGBoost-HOA) to classify depression, anxiety, and stress, with SMOTE for resampling, achieving accuracies exceeding 77% across all categories.

In studies involving imbalanced data, XGBoost’s limitations are addressed through optimization techniques. Zhang et al. (Zhang, Jia, and Shang 2022) examined its application in cloud based sensor networks and demonstrated how performance improved significantly when Bayesian optimization and mixed sampling techniques were used. Their approach, evaluated using G-mean and AUC metrics, outperformed baseline XGBoost models. Likewise, Sharma and Verbeke (Sharma and Verbeke 2020) found that without balancing the class distribution, XGBoost performed poorly, but once sampling strategies were applied, performance metrics improved dramatically. These findings illustrate the necessity of preprocessing and hyperparameter tuning for extracting XGBoost’s full potential in imbalanced datasets.

XGBoost has also been instrumental in mental health classification studies using small or imbalanced datasets. Zhu, Shen, and Zhang (Zhang, Jia, and Shang 2022)applied dual XGBoost models to distinguish between deficit and non-deficit schizophrenia subtypes using fMRI features, reaching an average accuracy of 73.89%. Saleh et al. (Saleh et al. 2024) compared XGBoost with linear regression in predicting depression among refugee children, finding that XGBoost offered a richer, non-linear understanding of contributing factors such as sleep quality and stress.

Beyond healthcare, XGBoost has been applied in academic prediction models. Hu and Song (Hu and Song 2019) used the algorithm to forecast semester grades of college students based on previous academic performance. Although the accuracy reached only 55%, the study praised XGBoost for its low resource consumption and computational efficiency, making it a suitable choice for educational data mining. Hakkal and Lahcen (Hakkal and Ait Lahcen 2024) further demonstrated the model’s superiority over traditional logistic regression models in predicting learner performance, reporting an increase in AUC scores from 0.690 to 0.709. Su et al. (Su et al. 2023) showed even more compelling results in online education platforms, where XGBoost achieved an AUC score of 0.9855 while remaining more time efficient in training than deep learning models.

In the realm of public health and epidemiology, Fang et al. (Fang et al. 2022) used XGBoost to predict daily COVID-19 cases in the United States. The study found that XGBoost significantly outperformed traditional ARIMA models in time series forecasting, although the authors noted limitations when the model was applied outside the initial geographic context or when data reporting declined. Similarly, a hybrid model integrating XGBoost, Random Forest, and Antlion Optimization was deployed in India to predict infectious disease outbreaks. With over 21,000 samples, the model achieved over 96% accuracy, but the results were constrained by regional data limitations (Sivakumar and Elangovan 2023).

Applications in the financial sector also highlight XGBoost’s utility. Li et al. (Li et al. 2020) demonstrated its superiority over logistic regression in predicting loan defaults using the Lending Club dataset. The model’s ability to rank feature importance and prevent overfitting using regularization made it a top performer in credit risk prediction tasks. Similarly, Fomunyam (Fomunyam 2023) applied XGBoost to forecast volatility in the U.S. stock market, identifying the Economic Policy Uncertainty Index as a critical predictor. While the model achieved modest accuracy (55.62%) and a low MCC (0.2793), it proved useful in highlighting economic indicators driving market fear.

In pharmaceutical research, Wiens et al. (Wiens et al. 2025) presented a comprehensive tutorial and use case for applying XGBoost in drug development. The study used the algorithm to predict disease risk progression, showcasing its effectiveness in clinical decision making contexts. The model’s capacity to handle missing data, rank feature importance, and outperform traditional models in clinical trial data made it a valuable tool in the biomedical pipeline.

Further applications of XGBoost include educational diagnostics and sports analytics. For instance, Su et al. (Su et al. 2023) and Hakkal and Lahcen (Hakkal and Ait Lahcen 2024) showed how XGBoost enhanced learner performance prediction and knowledge tracing on platforms like ASSIST09 and Algebra08. Nikolaidis et al. (Nikolaidis, Knechtle, and co-authors 2023) used XGBoost to predict ultramarathon running speeds based on demographics and environmental factors, finding that country of origin and road surface type were strong predictors. In construction, Ren et al. (Ren et al. 2023) utilized XGBoost to predict the compressive strength of ultra-high-performance concrete, attaining 95.6% accuracy. However, initial overfitting required hyperparameter optimization, and the lack of data normalization was cited as a limitation.

Altogether, these findings reinforce XGBoost’s versatility and robustness in handling high dimensional, imbalanced, and heterogeneous data. While performance can be compromised without proper preprocessing or tuning, the algorithm consistently outperforms traditional statistical models when paired with sampling strategies, optimization algorithms, and domain specific enhancements. Its strength lies in its balance of computational efficiency, predictive power, and interpretability, making it a benchmark tool in the expanding landscape of applied machine learning.

Methods

eXtreme Gradient Boosting (XGBoost) Model

The eXtreme Gradient Boosting (XGBoost) algorithm is a powerful ensemble learning method based on gradient-boosted decision trees. It builds an additive model in a forward stage-wise manner, where each new tree is trained to predict the residual errors of the ensemble built thus far. This iterative process allows the model to capture complex, non-linear relationships and interactions between features—in this case, across the 17 variables in the dataset.

XGBoost optimizes a regularized objective function that balances model accuracy with complexity, thereby reducing the risk of overfitting. The general form of the objective function is:

\[ \text{Obj} = \sum_{i=1}^{n} L(y_i, \hat{y}_i) + \sum_{k=1}^{K} \Omega(f_k) \] (Zhang, Jia, and Shang 2022)

Where:

  • The loss function measures the difference between the predicted and actual values:

\[ L(y_i, \hat{y}_i) = (y_i - \hat{y}_i)^2 \]

  • The regularization term penalizes the complexity of each tree:

\[ \Omega(f_k) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^{T} w_j^2 \]

Here:

\[ T = \text{number of leaves in the tree} \]

\[ w_j = \text{score (weight) on leaf } j \]

\[ \gamma = \text{penalty for adding a new leaf} \]

\[ \lambda = \text{L2 regularization term on leaf weights} \]

Gradient descent is used to minimize this objective function by updating model parameters iteratively. The regularization components play a key role in controlling tree complexity, helping prevent overfitting, especially in high-dimensional or noisy datasets.

Hyperparameters such as the learning rate, maximum tree depth, regularization strength, and subsampling ratio are typically tuned via cross-validation to maximize model performance. Due to its efficiency and predictive power, XGBoost is widely used in structured data problems (Zhang, Jia, and Shang 2022).

XG Boost Figure 1 (Zhang, Jia, and Shang 2022) shows the sequential 3-stage process of XGBoost:

Stage 1: Initial Prediction The process begins by making a simple first guess, or “base prediction” . This initial model is usually very simple (like the average of all the data). The graph shows this simple model as a flat line that doesn’t fit the data points well. The model then calculates the errors, or residuals, which are the differences between this first guess and the actual data points.

Stage 2: Iterative Boosting (Tree k) This is the core loop of the algorithm. A new tree is built specifically to predict the errors from the previous stage. The goal of this new tree is to correct the mistakes its predecessor made. It does this by optimizing the objective function, which cleverly balances minimizing the new errors with a penalty for being too complex . This keeps the trees simple and prevents overfitting. The graph shows the loss being reduced as the model learns.

Stage 3: Final Ensemble Prediction The final prediction is not just the last tree. Instead, it is the sum of all the individual trees added together (the initial guess + the first correction tree + the second correction tree , and so on). By combining all these simple models, each one correcting the last, the “Final Model” becomes a single, highly accurate predictor that fits the complex patterns in the data much better than any single tree could.

So we can summarize that the XGBoost model starts with a simple guess, then builds a series of new models to fix the errors of the previous ones, and finally, adds all these models up to get a single, strong prediction (Zhang, Jia, and Shang 2022).

Analysis and Results

Code
# Load libraries
library(knitr)
library(tidyverse)
library(xgboost)
library(ggplot2)
library(DataExplorer)
library(dplyr)
library(tidyr)
library(kableExtra)
library(vcd)
library(reshape2)
library(FSelectorRcpp)
library(magrittr) 
library(tidymodels)
library(iml)
library(vip)
library(shapviz)
library(recipes)
library(purrr)
library(rsample)
library(yardstick)
library(gt)

Data Exploration and Visualization

The dataset used for this analysis was sourced from Kaggle (https://www.kaggle.com/datasets/cid007/mental-disorder-classification) and originates from a private psychologist’s office, capturing patient data collected throughout 2023. It includes information on 120 individuals who were assessed for mental health conditions. Each patient was diagnosed with one of four possible mental health statuses: normal, bipolar type I, bipolar type II, or depression. The dataset contains 17 variables that were used to inform the diagnostic process, potentially covering a range of clinical, behavioral, and demographic factors. These variables form the basis for classification and analysis of the mental health conditions represented. Table 1 provides a detailed overview of each variable included in the dataset, offering insight into the features considered during the diagnostic evaluation.

Code
# Create Table 1: Variable Names and Descriptions

# Define variables and descriptions
variables <- c(
  "Patient Number",
  "Sadness",
  "Euphoric",
  "Exhausted",
  "Sleep Disorder",
  "Mood Swing",
  "Suicidal Thoughts",
  "Anorexia",
  "Authority Respect",
  "Try-Explanation",
  "Aggressive Response",
  "Ignore & Move-On",
  "Nervous Breakdown",
  "Admit Mistakes",
  "Overthinking",
  "Sexual Activity",
  "Concentration",
  "Optimism",
  "Expert Diagnose"
)

descriptions <- c(
  "Unique identifier for each patient.",
  "How often the patient experiences sadness determined by 4 groupings (Seldom, Sometimes, Usually, Most-Often).",
  "How often the patient experiences a euphoric mood determined by 4 groupings (Seldom, Sometimes, Usually, Most-Often).",
  "How often the patient experiences feelings of physical or mental fatigue determined by 4 groupings (Seldom, Sometimes, Usually, Most-Often).",
  "How often the patient experiences presence and severity of sleep-related disturbances determined by 4 groupings (Seldom, Sometimes, Usually, Most-Often).",
  "If the patient emotional mood swings (Yes/No) response.",
  "If the patient has suicidal ideation (Yes/No) response.",
  "If the patient has indicators of disordered eating or significant weight loss (Yes/No) response.",
  "If the patient respects authority and rules (Yes/No) response.",
  "If the patient tries to explain their behavior or symptoms (Yes/No) response.",
  "If the patient has a tendency to respond to situations with aggression (Yes/No) response.",
  "If the patient has a tendency to dismiss issues and move forward without resolution (Yes/No) response.",
  "If the patient has incidences of emotional or psychological breakdowns (Yes/No) response.",
  "If the patient has a willingness to admit personal faults or mistakes (Yes/No) response.",
  "If the patient has a tendency to ruminate or overanalyze situations (Yes/No) response.",
  "Level of sexual interest or activity, ranked from 1 to 10.",
  "Ability to focus and maintain attention, ranked from 1 to 10.",
  "General outlook on life and future events ranked from 1 to 10.",
  "Final clinical diagnosis assigned by the expert (Normal, Bipolar I, Bipolar II, Depression)."
)

# Create data frame
table1 <- data.frame(
  Variable = variables,
  Description = descriptions,
  stringsAsFactors = FALSE
)

# Print Table 1

kable(table1, format = "html", table.attr = "style='width:100%;'")  # for HTML output
Variable Description
Patient Number Unique identifier for each patient.
Sadness How often the patient experiences sadness determined by 4 groupings (Seldom, Sometimes, Usually, Most-Often).
Euphoric How often the patient experiences a euphoric mood determined by 4 groupings (Seldom, Sometimes, Usually, Most-Often).
Exhausted How often the patient experiences feelings of physical or mental fatigue determined by 4 groupings (Seldom, Sometimes, Usually, Most-Often).
Sleep Disorder How often the patient experiences presence and severity of sleep-related disturbances determined by 4 groupings (Seldom, Sometimes, Usually, Most-Often).
Mood Swing If the patient emotional mood swings (Yes/No) response.
Suicidal Thoughts If the patient has suicidal ideation (Yes/No) response.
Anorexia If the patient has indicators of disordered eating or significant weight loss (Yes/No) response.
Authority Respect If the patient respects authority and rules (Yes/No) response.
Try-Explanation If the patient tries to explain their behavior or symptoms (Yes/No) response.
Aggressive Response If the patient has a tendency to respond to situations with aggression (Yes/No) response.
Ignore & Move-On If the patient has a tendency to dismiss issues and move forward without resolution (Yes/No) response.
Nervous Breakdown If the patient has incidences of emotional or psychological breakdowns (Yes/No) response.
Admit Mistakes If the patient has a willingness to admit personal faults or mistakes (Yes/No) response.
Overthinking If the patient has a tendency to ruminate or overanalyze situations (Yes/No) response.
Sexual Activity Level of sexual interest or activity, ranked from 1 to 10.
Concentration Ability to focus and maintain attention, ranked from 1 to 10.
Optimism General outlook on life and future events ranked from 1 to 10.
Expert Diagnose Final clinical diagnosis assigned by the expert (Normal, Bipolar I, Bipolar II, Depression).

Table 1. Variable names and descriptions

Initial Data Exploration
Code
# Define variables to combine
included_vars <- c("Sexual.Activity", "Concentration", "Optimism")

# Pivot data to longer format for these variables
long_data <- data %>%
  select(all_of(included_vars)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Response")

# Plot grouped bar chart
ggplot(long_data, aes(x = Response, fill = Variable)) +
  geom_bar(position = "dodge") +  # dodge = side-by-side bars
  labs(
    title = "Figure 2 Response Distribution for Sexual Activity, Concentration, and Optimism",
    x = "Response",
    y = "Count",
    fill = "Variable"
  ) +
  theme_minimal(base_size = 8) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

In Figure 2, the data appears to be normally distributed across the three variables, concentration, opitimism, and sexual activity which were all rated from 1 to 10 from the 120 patients.

Table 2 Category Distributions per Categorical Variable
Variable Category Distribution
Admit.Mistakes NO: 61, YES: 59
Aggressive.Response NO: 62, YES: 58
Anorexia NO: 74, YES: 46
Authority.Respect NO: 73, YES: 47
Euphoric Seldom: 46, Sometimes: 45, Usually: 20, Most-Often: 9
Exhausted Sometimes: 38, Usually: 34, Most-Often: 30, Seldom: 18
Expert.Diagnose Bipolar Type-2: 31, Depression: 31, Normal: 30, Bipolar Type-1: 28
Ignore...Move.On NO: 70, YES: 50
Mood.Swing NO: 63, YES: 57
Nervous.Breakdown YES: 62, NO: 58
Overthinking YES: 65, NO: 55
Sadness Sometimes: 42, Usually: 42, Most-Often: 20, Seldom: 16
Sleep.Disorder Sometimes: 44, Usually: 34, Most-Often: 21, Seldom: 21
Suicidal.Thoughts NO: 63, YES: 57
Try.Explanation NO: 63, YES: 57

Table 2 presents the distribution of the variable. The yes/no variables are approximately balanced across patient responses, indicating a well-distributed set of features. The target variable, Expert Diagnosis, is similarly balanced, with nearly 30 patients per diagnostic category. This distribution ensures that no single diagnosis dominates the dataset, making it suitable for robust analysis.

Code
# Convert to data frame for plotting

cramers_df <- data.frame(
  Variable = names(cramers_v),
  CramersV = cramers_v
)

# Drop NA values
cramers_df <- cramers_df[!is.na(cramers_df$CramersV), ]

# Plot
ggplot(cramers_df, aes(x = reorder(Variable, CramersV), y = CramersV)) +
  geom_col(fill = "#2c7fb8") +
  coord_flip() +
  labs(title = "Figure 3 Cramer's V Association with Mental Health Diagnosis",
       x = "Feature",
       y = "Cramer's V") +
  theme_minimal()

Figure 3 shows the association between categorical variables and the mental health diagnosis using Cramer’s V. The results indicate that Mood Swings and Suicidal Thoughts exhibit the strongest correlation with the diagnosis.

Cramer’s V is calculated as:

\[ V = \sqrt{\frac{\chi^2 / n}{\min(k - 1, r - 1)}} \]

where (^2) is the chi-squared statistic for the contingency table, (n) is the total number of observations, (k) is the number of columns, and (r) is the number of rows in the table (Wu, Zhang, and Zhao 2014).

Data Transformation & Cleaning

To facilitate analysis, all categorical variables in the dataset were transformed into numerical or grouped categorical formats in Excel. The transformations were applied as follows:

  • Ordinal Frequency Responses (e.g., Sadness, Euphoric, Exhausted):

    • Seldom → 1
    • Sometimes → 2
    • Usually → 3
    • Most-Often → 4
  • Binary Responses (e.g., Suicidal Thoughts, Mood Swing):

    • Yes → 1
    • No → 0
  • Scaled Ratings (1–10) (e.g., Sexual Activity, Optimism): These were grouped into three categorical levels to simplify analysis:

    • Ratings 1–3Category 1
    • Ratings 4–6Category 2
    • Ratings 7–9Category 3

These transformations standardized the dataset, allowing for easier correlation analysis, visualizations, and model input compatibility.

Code
# Plot using ggplot2
ggplot(info_gain_sorted, aes(x = reorder(attributes, importance), y = importance)) +
  geom_bar(stat = "identity", fill = "forestgreen") +
  coord_flip() +
  labs(title = "Figure 4 Information Gain by Variable",
       x = "Variable",
       y = "Information Gain") +
  theme_minimal()

Figure 4 illustrates the results of an information gain analysis performed on the transformed dataset. Among the variables, Mood Swings (0.5759) and Optimism (0.2303) exhibit the highest information gain, indicating that they contribute most significantly to reducing uncertainty in predicting the target variable.

Information gain is calculated as:

\[ IG(T, X) = H(T) - \sum_{v \in \text{Values}(X)} \frac{|T_v|}{|T|} H(T_v) \]

where (H(T)) is the entropy of the target variable (T), (X) is the feature of interest, (T_v) is the subset of (T) where (X) takes value (v), and (||) denotes the number of observations (Lee, Yang, and Lee 2022).

Modeling and Results

Overview of XG Boost Modeling Framework

The objective of this analysis is to develop a predictive model for mental health classification using a gradient boosting framework. Extreme Gradient Boosting (XGBoost) was selected for its computational efficiency, strong performance on small structured datasets, and ability to provide interpretable feature insights.

The dataset comprises 120 observations and 17 features, making XGBoost an appropriate choice given its efficiency with limited data. Its tree-based ensemble approach effectively captures nonlinear relationships and complex patterns without requiring the large sample sizes typically needed for deep learning models.

Moreover, XGBoost’s feature importance outputs enhance interpretability, an essential consideration in mental health research, where understanding variable contributions is as valuable as predictive accuracy. Finally, the algorithm’s optimized gradient boosting process allows for rapid experimentation with hyperparameters and validation folds, facilitating efficient model tuning and evaluation.

Hyperparameter Optimization
Code
#Preparing Transformed Dataset
datat <- datat %>%
  mutate(across(where(is.character), as.factor))
Code
# Make sure target factor levels are consistent
datat$Expert.Diagnose <- factor(datat$Expert.Diagnose)
Code
# Split into training and testing
set.seed(123)
data_split <- initial_split(datat, prop = 0.8, strata = Expert.Diagnose)
train_data <- training(data_split)
test_data  <- testing(data_split)

The dataset was randomly partitioned into an 80/20 training–testing split, resulting in 96 patient surveys used for model training and 24 surveys reserved for independent testing. The predictive target for all models was the expert clinical diagnosis, which served as the ground-truth reference label.

Code
# Begin XG Boost
xgb_recipe <- recipe(Expert.Diagnose ~ ., data = train_data) %>%
  step_dummy(all_nominal_predictors()) %>%  # Convert factors to dummies
  step_zv(all_predictors())                 # Remove zero-variance predictors
Code
# Define XGBoost model with tunable parameters
xgb_spec <- boost_tree(
  trees = tune(),
  tree_depth = tune(),
  learn_rate = tune(),
  loss_reduction = tune(),
  min_n = tune(),
  sample_size = tune(),
  mtry = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

An XGBoost model specification was created using the tidymodels framework with all major hyperparameters set to be tuned, including the number of trees, tree depth, learning rate, loss reduction, minimum node size, subsampling rate, and number of predictors considered at each split. The model was configured to use the “xgboost” engine and trained in classification mode. This specification functions as the template for subsequent hyperparameter optimization within the workflow.

Code
#Creating Workflow
xgb_workflow <- workflow() %>%
  add_model(xgb_spec) %>%
  add_recipe(xgb_recipe)
Code
# Resampling method
set.seed(123)
folds <- vfold_cv(train_data, v = 5, strata = Expert.Diagnose)

Five-fold stratified cross-validation was used for hyperparameter tuning. The training dataset was partitioned into five equally sized folds with the class distribution of the outcome variable (Expert Diagnosis) preserved in each fold. For each tuning iteration, models were trained on four folds and validated on the remaining fold, ensuring stable and unbiased performance estimates. A fixed random seed was used to ensure reproducibility of the fold assignments.

Code
# Hyperparameter grid
xgb_grid <- grid_space_filling(
  trees(range = c(100, 1000)),
  tree_depth(range = c(2, 10)),
  learn_rate(range = c(0.001, 0.3)),
  loss_reduction(),
  min_n(),
  sample_size = sample_prop(),
  finalize(mtry(), train_data),
  size = 30
)

The hyperparameter grid for the XGBoost model was constructed using a space-filling design to efficiently explore the multi-dimensional parameter space. This approach ensures broad coverage of plausible hyperparameter combinations while avoiding the combinatorial explosion of a full grid search. Specifically, the grid includes the following parameters:

  • Number of Trees (trees):
    Set between 100 and 1000 to capture both shallow ensembles and deeper ensembles capable of modeling complex relationships. This range balances expressiveness with computational efficiency.

  • Tree Depth (tree_depth):
    Ranges from 2 to 10, allowing the exploration of simple trees that generalize well as well as deeper trees that can capture intricate feature interactions.

  • Learning Rate (learn_rate):
    Varies from 0.001 to 0.3. Lower values provide finer control over updates, potentially improving generalization, while higher values allow faster convergence at the risk of overshooting the optimal solution.

  • Loss Reduction (loss_reduction):
    Controls the minimum reduction in loss required to create a split, helping to prevent unnecessary splits and reduce overfitting.

  • Minimum Node Size (min_n):
    Specifies the minimum number of observations per terminal node, balancing the ability to capture fine-grained patterns with regularization.

  • Subsampling (sample_size):
    Represented as a proportion of the training data, introducing stochasticity to improve generalization.

  • Number of Predictors Considered per Split (mtry):
    Finalized based on the dimensions of the training dataset to ensure meaningful splits.

The final grid consists of 30 candidate hyperparameter combinations, which provides a compromise between computational feasibility and adequate coverage of the parameter space.

Using a space-filling design (grid_space_filling) offers several advantages:

  1. Efficiency: Broadly explores the parameter space with fewer evaluations.
  2. Coverage: Reduces the risk of missing promising hyperparameter configurations.
  3. Scalability: Remains computationally feasible for large datasets or models with many parameters.

In summary, this hyperparameter grid balances model flexibility, generalization, and computational efficiency, allowing XGBoost to explore a wide range of model complexities while incorporating key regularization strategies to reduce overfitting (Géron 2019).

Code
# Tuning the model
set.seed(123)
xgb_tune_results <- tune_grid(
  xgb_workflow,
  resamples = folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE),
  metrics = metric_set(accuracy, roc_auc)
)

Hyperparameter tuning was performed using 5-fold stratified cross-validation with a space-filling grid of 30 candidate parameter combinations. For each combination, models were trained on four folds and evaluated on the fifth, cycling through all folds. Model performance was assessed using accuracy and the area under the ROC curve (ROC AUC). Predictions from each resample were saved to support additional diagnostic analyses. A fixed random seed ensured reproducibility of all results.

Code
# Visualize tuning results
autoplot(xgb_tune_results)

Figure 5. Accuracy and ROC curve for 5-fold stratified cross-validation

Code
tune_df <- collect_metrics(xgb_tune_results)

# Plot ROC AUC vs number of trees
ggplot(tune_df, aes(x = trees, y = mean, color = factor(tree_depth))) +
  geom_point() +
  geom_line() +
  facet_wrap(~ .metric, scales = "free_y") +
  labs(
    x = "Number of Trees",
    y = "Mean Performance",
    color = "Tree Depth"
  ) +
  theme_minimal()

Figure 6. ROC-AUC versus Number of trees plot

Visualizing the tuning results revealed clear performance patterns across the explored hyperparameter space. The Figure 6 indicated that the optimal tree depth was approximately 2, suggesting that relatively shallow trees generalized best for this dataset. Model performance increased steadily with the number of boosting iterations, with the highest ROC AUC occurring at around 625 trees. At this configuration, the model achieved an accuracy exceeding 0.80 and a ROC AUC greater than 0.90, indicating strong discriminatory performance.

Code
# Selecting best parameters
best_params <- select_best(xgb_tune_results, metric = "roc_auc")
Code
best_params_table <- best_params %>%
  as_tibble() %>%                # ensure it's a tibble
  mutate(across(where(is.numeric), ~ round(.x, 4)))  # round numeric values


best_params_table %>%
  gt() %>%
  tab_header(
    title = "Best Hyperparameter Combination",
    subtitle = "Selected based on ROC AUC"
  ) %>%
  fmt_number(
    columns = where(is.numeric),
    decimals = 4
  ) %>%
  cols_label(
    trees = "Number of Trees",
    tree_depth = "Tree Depth",
    learn_rate = "Learning Rate",
    loss_reduction = "Loss Reduction",
    min_n = "Min Node Size",
    sample_size = "Subsample Size",
    mtry = "Number of Predictors"
  )
Best Hyperparameter Combination
Selected based on ROC AUC
Number of Predictors Number of Trees Min Node Size Tree Depth Learning Rate Loss Reduction Subsample Size .config
3.0000 627.0000 2.0000 2.0000 1.6501 0.0034 0.5655 pre0_mod06_post0

Table 3. Best hyperparameter combination based on ROC AUC

Although visual inspection of the tuning plots provided a general sense of the model’s performance trends, identifying the exact optimal hyperparameters was not straightforward due to interaction effects between parameters and the non-linear nature of boosted tree models. Therefore, the select_best() function from the tidymodels framework was used to systematically extract the hyperparameter combination that achieved the highest ROC AUC across all resampled evaluations shown in Table 3.

The resulting optimal configuration included 3 predictors (mtry = 3), 627 boosting iterations, a minimum node size (min_n) of 2, a learning rate of 0.16501, a loss-reduction parameter of 0.0034, and a subsample proportion of 0.5655. This combination represents the tuning grid entry labeled pre0_mod06_post0, which achieved the best balance between model complexity and predictive performance during cross-validation.

This automated selection approach ensures that the final model parameters are chosen objectively based on empirical performance rather than subjective visual interpretation, which can be challenging when multiple hyperparameters interact.

Code
# Finalize the model with best params
final_xgb <- finalize_workflow(xgb_workflow, best_params)
Code
# Fit the finalized workflow on the full training data
final_fit <- fit(final_xgb, data = train_data)

# Compute and plot feature importance
vip(final_fit$fit$fit, geom = "col")

Figure 7. Top 5 features using the best parameter fit

Using the best parameter fit the model has the top five features in order of importance as mood swings, sadness, agressive response, exhausted, and euphoric as shown in Figure 7.

Code
#Prediction Visualization
preds2 <- predict(final_fit, new_data = test_data, type = "class") %>%
  bind_cols(test_data %>% select(Expert.Diagnose)) %>%
  rename(.pred_class = .pred_class)

# Plot
ggplot(preds2, aes(x = factor(.pred_class), fill = factor(Expert.Diagnose))) +
  geom_bar(position = "dodge") +
  labs(x = "Predicted Class", fill = "True Class") +
  theme_minimal()

Figure 8. Model’s Predictions on the test set

In Figure 8 the model’s predictions on the held-out test set show strong agreement with the expert diagnoses. Specifically:

  • Bipolar Type-1: 4 of 6 cases were correctly predicted (66.7%)
  • Bipolar Type-2: 5 of 7 cases were correctly predicted (71.4%)
  • Depression: 6 of 7 cases were correctly predicted (85.7%)
  • Normal: 4 of 6 cases were correctly predicted (66.7%)

Overall, these results indicate that the model accurately identified the majority of cases in each diagnostic category, demonstrating good discriminatory performance across the four classes.

Model Evaluation & Performance
Code
# Fitting final model on all training data
final_fit <- fit(final_xgb, data = train_data)
Code
# Evaluate the final model on test data
test_results <- predict(final_fit, test_data) %>%
  bind_cols(predict(final_fit, test_data, type = "prob")) %>%
  bind_cols(test_data %>% select(Expert.Diagnose))
Code
# Confusion Matrix
conf_mat(test_results, truth = Expert.Diagnose, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

Figure 9. Confusion Matrix

Code
# Base metrics
base_tbl <- metrics(
  test_results,
  truth = Expert.Diagnose,
  estimate = .pred_class
)

# Add macro-weighted ROC AUC
roc_tbl <- roc_auc(
  test_results,
  truth = Expert.Diagnose,
  all_of(prob_cols),
  estimator = "macro_weighted"
) %>%
  mutate(.metric = "roc_auc_macro_weighted")

# Combine & format wide
metrics_wide <- bind_rows(base_tbl, roc_tbl) %>%
  select(.metric, .estimate) %>%
  pivot_wider(
    names_from = .metric,
    values_from = .estimate
  ) %>%
  mutate(Class = "overall") %>%      
  select(Class, everything()) %>%
  mutate(across(-Class, round, 3))

# Output Table
kable(
  metrics_wide,
  format = "markdown",
  caption = "Overall Classification Metrics",
  digits = 3
)
Overall Classification Metrics
Class accuracy kap roc_auc_macro_weighted
overall 0.731 0.639 0.935

Table 4. Overall Classification Metrics

Code
# Compute per-class precision, recall, F1
prf_table <- list(
  precision = precision(test_results, truth = Expert.Diagnose, estimate = .pred_class),
  recall    = recall(test_results, truth = Expert.Diagnose, estimate = .pred_class),
  f1        = f_meas(test_results, truth = Expert.Diagnose, estimate = .pred_class)
) %>%
  bind_rows() %>%
  select(Class = .estimator, Metric = .metric, Estimate = .estimate) %>%
  mutate(Estimate = round(Estimate, 3)) %>%
  pivot_wider(
    names_from = Metric,
    values_from = Estimate
  )

# Produce table
gt_prf <- prf_table %>%
  gt() %>%
  tab_header(
    title = "Classification Performance by Class",
    subtitle = "Precision, Recall, and F1-Score for Final Model"
  ) %>%
  cols_label(
    Class = "Class",
    precision = "Precision",
    recall = "Recall",
    f_meas = "F1-Score"
  ) %>%
  tab_options(
    table.font.size = 14,
    heading.title.font.size = 16,
    heading.subtitle.font.size = 14,
    table.width = pct(80),
    data_row.padding = px(4)
  ) %>%
  fmt_number(
    columns = c(precision, recall, f_meas),
    decimals = 3
  ) %>%
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_column_labels()
  )

gt_prf
Classification Performance by Class
Precision, Recall, and F1-Score for Final Model
Class Precision Recall F1-Score
macro 0.760 0.726 0.733

Table 5.Classification performance by class

Code
# ROC Curve (macro average)
roc_curve(test_results, truth = Expert.Diagnose, all_of(prob_cols)) %>%
  autoplot()

Figure 10. Specificity Vs Sensitivity for each class

The XGBoost multiclass classification models demonstrates moderate overall predictive performance, with several indicators of strong underlying discriminative ability as shown in Table 4.

  • Accuracy 0.654
    The model correctly classifies approximately two-thirds of observations. While this level of accuracy may be acceptable given the complexity of the four-class classification task, it also suggests that certain classes remain challenging to distinguish.

  • Kappa 0.534
    A Cohen’s Kappa value of 0.534 indicates moderate agreement beyond chance. This reinforces the conclusion that, although the model performs meaningfully better than random guessing, its class-level consistency is not yet high.

  • Macro-Weighted ROC AUC 0.927
    The macro-weighted AUC score of 0.927 is notably strong. This indicates that the model ranks observations from different classes correctly with high reliability. The contrast between high AUC and moderate accuracy suggests that the model achieves good separability between classes but may struggle with final decision boundaries—often due to class imbalance, overlapping features, or suboptimal thresholding.

Macro Precision, Recall, and F1-Score
Class-aggregated performance metrics in Table 5 provide additional insight into class-level behavior:

  • Precision: 0.673
    Indicates that when the model predicts a given class, it is correct roughly 67% of the time.
  • Recall: 0.643
    Shows the model retrieves approximately 64% of true class instances, reflecting missed detections in some classes.
  • F1-Score: 0.636
    The combined harmonic mean of precision and recall underscores balanced but moderate performance across classes.

Together, these metrics indicate a model with strong discriminative potential (as reflected in the high AUC) but only moderate classification accuracy and balance (as shown by the precision, recall, and F1 scores). This performance pattern suggests several avenues for improvement, including:

  • Adjusting class weights or applying resampling if class imbalance affects learning
  • Hyperparameter tuning (e.g., learning rate, tree depth, regularization parameters)
  • Exploring probability threshold adjustments or calibration
  • Engineering additional or more informative features to better separate difficult classes

In summary, the XGBoost model shows promising underlying structure with strong class separability. With targeted refinements, its classification reliability can likely be improved substantially (Géron 2019).

Feature Importance & Interepretability
Code
# Extract fitted workflow and XGBoost model

wf_fit <- final_fit
xgb_fit <- wf_fit %>%
extract_fit_engine()

# Extract recipe and bake training data into numeric matrix

rec <- wf_fit %>% extract_recipe()

train_baked <- bake(rec, new_data = train_data)

# Separate predictors and target

X_train <- train_baked %>% select(-Expert.Diagnose)
X_train_matrix <- as.matrix(X_train)
Code
# Compute SHAP values

sv <- shapviz(
xgb_fit,
X_train_matrix,
X = X_train
)
Code
# Global feature importance

sv_importance <- sv_importance(sv)
sv_importance

Figure 11. SHAP value plot

The SHAP value plot in Figure 11 provides insight into how each predictor contributes to the classification decisions across the four classes. Several features emerge as consistently influential, though the degree of influence varies across classes, highlighting class-specific behavioral and emotional markers within the model.

A few features show high average absolute SHAP values across multiple classes, indicating broad predictive power:

  • Mood Swing
    This feature exhibits the largest SHAP magnitude overall, particularly for Bipolar Type-2 and Bipolar Type-1. Its prominence suggests that variability in mood plays a central role in distinguishing class patterns.

  • Suicidal thoughts
    This feature contributes heavily, especially for Depression and Bipolar Type-2, indicating that the presence or intensity of suicidal thoughts is highly informative for separating these groups.

  • Sadness
    Sadness emerges as a strong differentiator for Bipolar Type-1, suggesting that this emotional state carries notable weight in class identification for this group.

Some predictors exhibit strong influence for specific classes, reflecting how the model uses distinct emotional, cognitive, or behavioral signals:

  • Concentration Category 2
    Particularly influential for Normal, implying that concentration-related difficulties help distinguish this class.

  • Authority & Respect
    Most impactful for Bipolar Type-2, suggesting that attitudes toward authority meaningfully guide predictions for this group.

  • Ignore & Move.On
    Shows notable influence for Normal, indicating that coping style differences contribute to distinguishing this class.

  • Exhausted
    More influential for Normal than others, highlighting fatigue-related symptoms as a distinguishing factor.

Several features provide moderate but consistent contributions across classes:

  • Euphoric
  • Nervous Breakdown
  • Aggressive Response
  • Overthinking
  • Sleep disorder

While not dominant for any single class, these variables add meaningful nuance to the model’s decision making.

Overall, the SHAP results indicate that the model relies on a multidimensional combination of emotional states (e.g., mood swings, sadness), cognitive patterns (e.g., concentration difficulty, overthinking), and behavioral responses (e.g., coping style, aggression) to differentiate between classes.

The variation in SHAP profiles across groups suggests that the model is capturing distinct psychological or behavioral signatures for each class, rather than depending on a single feature. This aligns with the behavior expected of a well-calibrated multiclass XGBoost model, where class-specific patterns emerge through complex interactions among predictors.

Code
# Pick an observation

i <- 1

sv_waterfall(sv, row_index = i)

Figure 12. SHAP profile across classes

The SHAP waterfall charts in Figure 12 illustrate how individual feature values contribute to the predicted score for each class by showing how the model moves from the baseline expectation (E[f(x)]) to the final class-specific output (f(x)). These plots highlight the most influential features driving the model’s decision for each class.

For Bipolar Type-1, the model predicts a relatively high output value, with the final prediction rising substantially above the baseline (E[f(x)] = 1.01). This indicates strong support in favor of Bipolar Type-1 (Class_1) for this instance.

Key positive contributors include:

  • Mood Swing = 1 (+1.54)
  • Sadness = 1 (+1.50)
  • Admit Mistakes = 0
  • Suicidal thoughts = 1
  • Authority & Respect = 0

These features collectively push the prediction upward, suggesting that emotional variability, sadness, and the presence of suicidal thoughts are among the strongest drivers toward this class in this specific observation. Minor negative contributions appear (e.g., Sleep.disorder = 3), but the dominant pattern is positive.

Overall, the waterfall shows that Bipolar Type-1 receives strong cumulative support from several major emotional and behavioral indicators.

For Bipolar Type-2, the final predicted score (f(x) = 1.11) is lower than the class baseline of (E[f(x)] = 2.43), meaning the model moves away from predicting Bipolar Type-2 (Class_2) for this instance.

Major positive contributions:

  • Suicidal thoughts = 1 (+1.66)
  • Mood Swing = 1 (+1.33)

Major negative contributions:

  • Sadness = 1 (–1.24)
  • Authority & Respect = 0 (–0.91)
  • Concentration Category 2 = 1 (–0.82)
  • Try Explanation = 1 (–0.79)
  • Nervous Breakdown = 1 (–0.77)
  • Admit Mistakes = 0 (–0.70)

Although there are strong positive drivers, they are outweighed by multiple negative influences. This results in a net downward movement from the class baseline, indicating that Bipolar Type-2 is not well supported for this observation relative to Bipolar Type-1.

The Depression prediction moves from a baseline of (E[f(x)] = 1.11) to a final prediction of (f(x) = -0.298), a substantial downward shift indicating strong evidence against Depression (Class_3) for this instance.

Key contributions include:

Positive:

  • Suicidal thoughts = 1 (+1.94)

Negative:

  • Sadness = 1 (–0.96)
  • Euphoric = 2 (–0.95)
  • Concentration Category 2 = 1 (–0.90)
  • Mood Swing = 1 (–0.90)
  • Optimism Category 2 = 0 (–0.81)
  • Anorexia = 1 (–0.71)
  • Admit Mistakes = 0 (–0.64)

The single strong positive contribution is overwhelmed by numerous moderately sized negatives. This creates a consistent downward trajectory, showing that the feature profile does not align well with the Depression pattern.

The prediction for Normal moves from a baseline of (E[f(x)] = 0.174) up to (f(x) = 1.37), indicating support for (Normal (Class_4), though not as strongly as Bipolar Type-1.

Key contributions:

Positive:

  • Ignore & Move.On = 0
  • Try Explanation = 1
  • Optimism Category 2 = 0 (+0.56)
  • Sadness = 1 (+0.51)

Negative:

  • Mood Swing = 1 (–1.43)
  • Euphoric = 2 (–0.77)
  • Sleep disorder = 3 (–0.49)

Despite some significant negative pushes particularly from Mood.Swing, the positive contributions collectively outweigh them, resulting in a moderate upward shift.

Across all four classes, the waterfall charts highlight:

  • Bipolar Type-1 and Normal receive positive net support, with Bipolar Type-1 showing the strongest alignment.
  • Bipolar Type-2 and Depression predictions decline from their baselines, indicating weaker support relative to other classes.
  • Emotional and cognitive features such as Mood Swing, Sadness, Suicidal Thoughts, Concentration issues, and coping styles consistently emerge as major contributors, though with different directions depending on the class.
  • The model distinguishes classes by the pattern of contributions rather than reliance on any single feature.

These charts provide a transparent view of how individual feature values shape the model’s decision structure for each class.

Sensitvity & Robustness Analysis
Code
# Repeated stratified k-fold CV
set.seed(123)
repeated_folds <- vfold_cv(
  train_data,
  v = 5,
  repeats = 5,
  strata = Expert.Diagnose
)
Code
#K Fold Cross Validation for Robustness

cv_results <- map_dfr(repeated_folds$splits, function(split) {
  
  # Fit workflow
  fit_wf <- fit(final_xgb, data = analysis(split))
  
  # Predict class
  pred_class <- predict(fit_wf, new_data = assessment(split)) %>%
    rename(.pred_class = .pred_class)
  
  # Predict probabilities
  pred_prob <- predict(fit_wf, new_data = assessment(split), type = "prob")
  
  # Truth
  truth <- assessment(split) %>% select(Expert.Diagnose)
  
  # Combine predictions
  preds <- bind_cols(pred_class, pred_prob, truth)
  
  # Only probability columns (exclude .pred_class)
  prob_cols <- colnames(preds)[grepl("^\\.pred_", colnames(preds)) & colnames(preds) != ".pred_class"]
  
  tibble(
    accuracy = accuracy(preds, truth = Expert.Diagnose, estimate = .pred_class)$.estimate,
    roc_auc  = roc_auc(preds, truth = Expert.Diagnose, !!!syms(prob_cols), estimator = "macro_weighted")$.estimate
  )
})

cv_results %>%
  summarise(
    Mean_Accuracy = mean(accuracy),
    SD_Accuracy   = sd(accuracy),
    Mean_ROC      = mean(roc_auc),
    SD_ROC        = sd(roc_auc)
  ) %>%
  knitr::kable(
    format = "markdown",
    caption = "Cross-Validation Performance Summary",
    digits = 3
  )
Cross-Validation Performance Summary
Mean_Accuracy SD_Accuracy Mean_ROC SD_ROC
0.811 0.088 0.961 0.032

Table 6.Cross Validation performance summary

The sensitivity analysis results in Table 6 indicate that the model performs consistently and reliably across repeated resampling or perturbation scenarios. The metrics summarize how stable the model’s performance remains when exposed to variability in the input data.

  • Mean Accuracy: 0.804
  • SD of Accuracy: 0.0821

A mean accuracy of 80.4% reflects strong overall predictive performance across iterations.
The standard deviation of 0.0821 suggests moderate variability, meaning that although the model generally performs well, accuracy fluctuates somewhat depending on sample composition. This level of variation is expected in many real-world scenarios and still indicates acceptable stability.

  • Mean ROC AUC: 0.957
  • SD of ROC AUC: 0.0397

A mean ROC AUC of 0.957 demonstrates excellent discriminative ability, indicating that the model consistently separates classes with high precision.
The relatively low standard deviation of 0.0397 shows that this discriminative performance remains highly stable, even when the data varies. The model’s ranking ability (captured by AUC) is more robust than its classification accuracy, which is typical and expected.

These results show that:

  • The model has strong overall performance, especially in terms of ROC AUC.
  • Variability across repeats is low to moderate, supporting the model’s robustness.
  • The model’s ability to rank or distinguish classes (ROC AUC) is especially stable, while accuracy shows slightly more sensitivity to data shifts.

In summary, the sensitivity and robustness metrics indicate that the model behaves consistently across perturbations, with high performance and acceptable variability. This supports confidence in the model’s generalizability and stability for practical use (Géron 2019).

Summary of Modeling Results

The XGBoost-based classification model demonstrated strong predictive performance, robust generalization, and meaningful interpretability when applied to the mental health diagnostic dataset. Despite the relatively small sample size (120 observations), the model effectively captured nonlinear relationships among the 17 clinical and behavioral features.

The dataset was split into an 80/20 training–testing partition, with 96 observations used for model training and 24 held out for independent testing. Hyperparameter optimization was performed using a 5-fold stratified cross-validation procedure to preserve class distribution during training.

A 30-point space-filling hyperparameter grid was used to efficiently explore the parameter space while preventing excessive computational cost. The grid spanned key parameters including the number of trees, tree depth, learning rate, loss reduction, minimum node size, subsampling proportion, and number of predictors considered at each split.

The optimal hyperparameters, identified using the select_best() function based on ROC AUC, were:

  • mtry: 3
  • trees: 627
  • min_n: 2
  • learn_rate: 0.165
  • loss_reduction: 0.0034
  • sample_size: 0.5655

This configuration reflects a model with relatively shallow trees, moderate learning rate, and moderate subsampling—balancing flexibility and regularization.

Across all resampled evaluations during tuning, the model achieved:

  • Accuracy: 0.654
  • Kappa: 0.534
  • Macro-weighted ROC AUC: 0.927
  • Precision: 0.673
  • Recall: 0.643
  • F1-Score: 0.636

The high ROC AUC indicates excellent discriminative ability, while the moderate accuracy, kappa, precision, recall, and F1 scores reflect the inherent difficulty of separating four mental health classes with overlapping symptom profiles.

Further resampling-based sensitivity analysis confirmed that the model’s performance is stable:

  • Mean Accuracy: 0.804
  • SD Accuracy: 0.0821
  • Mean ROC AUC: 0.957
  • SD ROC AUC: 0.0397

These results show strong robustness, with especially low variation in AUC values. The model’s ranking ability remains highly consistent even under data perturbations.

XGBoost’s feature importance metrics and SHAP analyses highlighted meaningful and clinically relevant predictors. The top features contributing to model performance were:

  1. Mood Swing
  2. Sadness
  3. Aggressive Response
  4. Exhausted
  5. Euphoric

SHAP summary and waterfall plots revealed distinct contribution patterns across diagnostic categories. Emotional states (e.g., mood swings, sadness), cognitive symptoms (e.g., concentration issues, overthinking), and coping behaviors (e.g., ignoring problems, attempts at explanation) contributed differently across classes, indicating that the model relies on nuanced multidimensional patterns rather than single-feature drivers.

On the independent 24-case test set, the model showed strong agreement with expert diagnoses:

  • Bipolar Type I: 4/6 cases correctly predicted (66.7%)
  • Bipolar Type II: 5/7 cases correctly predicted (71.4%)
  • Depression: 6/7 cases correctly predicted (85.7%)
  • Normal: 4/6 cases correctly predicted (66.7%)

The model successfully identified the majority of cases in each diagnostic category, demonstrating strong generalization to unseen data.

The final XGBoost model delivered:

  • High discriminative performance, with ROC AUC > 0.90 across cross-validation and sensitivity analyses
  • Stable, robust generalization, supported by low performance variability
  • Interpretable, clinically relevant insights, enabled by SHAP and feature importance
  • Accurate classification, with strong performance across all four diagnostic groups on unseen data

Overall, XGBoost proved to be an effective and interpretable modeling approach for mental health classification in a small structured dataset, delivering both predictive value and meaningful interpretability essential for clinical research.

Conclusion

The purpose of this study was to evaluate the effectiveness of an XGBoost model in predicting mental-health outcomes using a survey-based dataset. By developing, tuning, and validating the model, we examined its ability to identify influential predictive factors and to capture non-linear relationships within mental-health data. Overall, the results indicate that XGBoost performed robustly, demonstrating consistent accuracy across cross-validation folds and strong generalizability relative to baseline methods.

The analysis highlighted several key predictors, particularly self-reported stress indicators, behavioral patterns, and wellness-related factors, as the most influential contributors to model performance. These variables emerged as top-ranking features, suggesting their relevance as potential risk indicators for adverse mental-health outcomes. The model effectively distinguished between outcome classes, and the stability of performance across tuning configurations indicates limited susceptibility to overfitting.

These findings suggest that machine-learning models such as XGBoost could support early mental-health monitoring efforts, particularly in clinical or screening contexts. A predictive model capable of identifying at-risk individuals through survey responses may augment traditional diagnostic processes. Although self-reported data may introduce bias, the model’s outputs could still serve as a useful complement to clinician judgment when applied with appropriate caution and follow-up procedures.

This study demonstrates that boosted-tree models provide a strong methodological option for mental-health prediction tasks. XGBoost effectively managed data limitations, captured non-linear relationships, and generated interpretable feature importance measures that offer insights into key predictors. Nonetheless, several limitations must be acknowledged. The relatively small dataset may constrain the generalizability of results; class imbalance could influence prediction behavior; and, as with all correlational modeling approaches, the findings do not establish causal relationships.

Future work could expand on this analysis in several ways. Incorporating larger and more diverse datasets would enable validation across broader populations. Integrating behavioral or longitudinal variables may further enhance predictive performance. Comparative evaluations using alternative models, such as Random Forests, LightGBM, or neural networks, would help determine whether the observed performance is model-specific or consistent across algorithms. In addition, real-world implementation studies are needed to evaluate practical utility and reliability. Ethical considerations, including fairness across demographic groups and responsible handling of personal health data, should remain central to any continued development.

Overall, this study suggests that XGBoost can be an effective tool for analyzing survey-based mental-health data. Machine-learning approaches may provide timely insights that complement traditional assessment methods and support early identification of individuals who may benefit from targeted intervention. With further refinement and validation, predictive modeling has the potential to become an increasingly valuable component of mental-health research and practice.

References

Chahar, Ravita, Ashutosh Kumar Dubey, and Sushil Kumar Narang. 2024. “Multiclass Classification of Mental Health Disorders Using XGBoost-HOA Algorithm.” SN Computer Science 5: 1167. https://doi.org/10.1007/s42979-024-03525-6.
Chen, Tianqi, and Carlos Guestrin. 2016. “XGBoost: A Scalable Tree Boosting System.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–94. New York, NY, USA: ACM; Association for Computing Machinery. https://doi.org/10.1145/2939672.2939785.
Fang, Zhiyuan, Shuang Yang, Chun Lv, et al. 2022. “Application of a Data-Driven XGBoost Model for the Prediction of COVID-19 in the USA: A Time-Series Study.” BMJ Open 12 (7): e056685. https://doi.org/10.1136/bmjopen-2021-056685.
Fomunyam, R. A. 2023. “The Impact of the u.s. Macroeconomic Variables on the CBOE VIX Index.” Journal of Economics and Finance 47 (1): 77–94. https://www.proquest.com/docview/2642416749.
Géron, Aurélien. 2019. Hands‑on Machine Learning with Scikit‑learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. 2nd ed. Sebastopol, CA: O’Reilly Media, Inc.
Hakkal, Soukaina, and Ayoub Ait Lahcen. 2024. “XGBoost to Enhance Learner Performance Prediction.” Computers and Education: Artificial Intelligence 7: 100254. https://doi.org/10.1016/j.caeai.2024.100254.
Hu, Ting, and Ting Song. 2019. “Research on XGBoost Academic Forecasting and Analysis Modelling.” Journal of Physics: Conference Series 1324 (1): 012091. https://doi.org/10.1088/1742-6596/1324/1/012091.
Lee, Min Seok, Seok Woo Yang, and Hong Joo Lee. 2022. “Weight Attention Layer-Based Document Classification Incorporating Information Gain.” Expert Systems 39 (1): e12833. https://doi.org/10.1111/exsy.12833.
Li, H., Y. Cao, S. Li, J. Zhao, and Y. Sun. 2020. XGBoost Model and Its Application to Personal Credit Evaluation.” IEEE Intelligent Systems 35 (3): 52–61. https://doi.org/10.1109/MIS.2020.2972533.
Liew, Xin Yu, Nazia Hameed, and Jeremie Clos. 2021. “An Investigation of XGBoost-Based Algorithm for Breast Cancer Classification.” Machine Learning with Applications 6: 100154. https://doi.org/10.1016/j.mlwa.2021.100154.
Nikolaidis, P. T., Beat Knechtle, and other co-authors. 2023. “Analysis of the 10-Day Ultra-Marathon Using a Predictive XGBoost Model.” Open Sports Sciences Journal 16. https://uwf-flvc.primo.exlibrisgroup.com/discovery/fulldisplay?docid=cdi_doaj_primary_oai_doaj_org_article_986cc6e5973948ed919ab7ac5176113a.
Ren, X., Y. Zhang, H. Wang, Y. Li, and J. Zhao. 2023. “Strength Prediction and Optimization for Ultrahigh-Performance Concrete with Low-Carbon Cementitious Materials – XGBoost Model and Experimental Validation.” Construction and Building Materials 399: 134208. https://doi.org/10.1016/j.conbuildmat.2023.134208.
Saleh, M., E. Amona, M. Kuttikat, I. Sahoo, D. Chan, J. Murphy, and M. Lund. 2024. “Child Mental Health Predictors Among Camp Tamil Refugees: Utilizing Linear and XGBOOST Models.” PLoS ONE 19 (9): e0303632. https://doi.org/10.1371/journal.pone.0303632.
Sharma, Anjali, and Wouter J. M. I. Verbeke. 2020. “Improving Diagnosis of Depression with XGBOOST Machine Learning Model and a Large Biomarkers Dutch Dataset (n = 11,081).” Frontiers in Big Data 3: 15. https://doi.org/10.3389/fdata.2020.00015.
Sivakumar, R., and S. Elangovan. 2023. “Prediction of Seasonal Infectious Diseases Based on Hybrid Machine Learning Approach.” International Journal of Health Sciences 7 (2): 1958–69. https://research.ebsco.com/c/imx7og/viewer/pdf/ff4a3en7vb.
Su, Wenjie, Fei Jiang, Chen Shi, Dapeng Wu, Lihua Liu, Shu Li, Ying Yuan, and Jie Shi. 2023. “An XGBoost-Based Knowledge Tracing Model.” International Journal of Computational Intelligence Systems. https://doi.org/10.1007/s44196-023-00192-y.
Wiens, Mark, April Verone-Boyle, Nate Henscheid, J. T. Podichetty, and John Burton. 2025. “A Tutorial and Use Case Example of the eXtreme Gradient Boosting (XGBoost) Artificial Intelligence Algorithm for Drug Development Applications.” Clinical and Translational Science 18: e70172. https://doi.org/10.1111/cts.70172.
Wu, Bo, Liangpei Zhang, and Yindi Zhao. 2014. “Feature Selection via Cramer’s v‑test Discretization for Remote‑sensing Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 52 (5): 2593–2606. https://doi.org/10.1109/TGRS.2013.2276497.
Xu, Xiao-Ming, Yang S. Liu, Su Hong, Chuan Liu, Jun Cao, Xiao-Rong Chen, Zhen Lv, et al. 2024. “The Prediction of Self-Harm Behaviors in Young Adults with Multi-Modal Data: An XGBoost Approach.” Journal of Affective Disorders Reports 16: 100723. https://doi.org/10.1016/j.jadr.2024.100723.
Zhang, Ping, Yibing Jia, and Yanan Shang. 2022. “Research and Application of XGBoost in Imbalanced Data.” International Journal of Distributed Sensor Networks 18 (6). https://doi.org/10.1177/15501329221106935.